Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add leftsemi merge #57979

Closed
wants to merge 4 commits into from
Closed

ENH: Add leftsemi merge #57979

wants to merge 4 commits into from

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 24, 2024

I want to add anti joins as a follow up, which can be incorporated into the new class

@phofl phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Enhancement labels Mar 24, 2024
khiter_t k
Int64Vector locs = Int64Vector()
Int64Vector self_locs = Int64Vector()
Int64VectorData *l
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic nit but would be better to not use single character variable names, especially those that conflict with debugger aliases. The cython debugger is hard enough to use as is :-)

Int64Vector self_locs = Int64Vector()
Int64VectorData *l
Int64VectorData *sl
# mask not implemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise NotImplementedError here for now?


for i in range(n):
val = values[i]
hash(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the point of this just to raise an error if an object is not hashable?

val = values[i]
hash(val)

k = kh_get_pymap(self.table, <PyObject*>val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cast required here? I assume it would know we are already working with PyObjects given the declaration of values

@@ -1683,7 +1681,7 @@ def get_join_indexers(
left_keys: list[ArrayLike],
right_keys: list[ArrayLike],
sort: bool = False,
how: JoinHow = "inner",
how: JoinHow + Literal["leftsemi"] = "inner",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we don't just add leftsemi to the JoinHow definition?

tm.assert_frame_equal(result, expected.head(1))


def test_leftsemi_invalid():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for raising NotImplementederror when passing a mask?

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerned about whether this argument will apply to DataFrame.join().

@@ -326,6 +326,11 @@
join; sort keys lexicographically.
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
* leftsemi: Filter for rows in the left that have a match on the right;
preserve the order of the left keys. Doesn't support `left_index`, `right_index`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
preserve the order of the left keys. Doesn't support `left_index`, `right_index`,
preserve the order of the left keys. Does not support `left_index`, `right_index`,

@@ -447,7 +447,7 @@ def closed(self) -> bool:
AnyAll = Literal["any", "all"]

# merge
MergeHow = Literal["left", "right", "inner", "outer", "cross"]
MergeHow = Literal["left", "right", "inner", "outer", "cross", "leftsemi"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MergeHow is also used as the type for how in DataFrame.join().

So if DataFrame.join() will not support the "leftsemi" operation, then you'll need to split this type into a JoinHow and MergeHow, with the latter dependent on the former.

If DataFrame.join() will support that operation, then you'll have to update the docs for join and add a test for it.

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 26, 2024
@mroeschke
Copy link
Member

Closing to clear the queue, but feel free to update and reopen anytime

@mroeschke mroeschke closed this Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support Left Semi Joins
4 participants